[megatron] don't re-assert no_sync_func every step with overlap_grad_reduce by HaozheZhang6 · Pull Request #2066 · THUDM/slime

HaozheZhang6 · 2026-06-12T06:30:11Z

Summary

train() in slime/backends/megatron_utils/model.py sets up config.no_sync_func on every step, but config = get_model_config(model[0]) is the model config and persists across steps. With --overlap-grad-reduce, the first step passes assert config.no_sync_func is None and sets it; the second step then trips that same assert and crashes.

assert config.no_sync_func is None, (...)   # step 1: None -> ok, sets it
                                            # step 2: not None -> AssertionError

Fix

Guard the setup with if config.no_sync_func is None: so the sync funcs are set once. They're constant (the model chunks' own no_sync / start_grad_sync), so skipping on later steps is a no-op — the only thing the per-step re-run did was trip the assert.

Reproduction

Run any training with --overlap-grad-reduce (from #1779) — it crashes on the 2nd step without this change.

Notes

I couldn't add an automated test: reproducing needs a multi-GPU Megatron training run with --overlap-grad-reduce, which I can't stand up locally. The fix is small and verified by inspection against the reported repro; happy to add a test if you can point me at a lightweight harness for this path.
Opening as a draft to confirm the approach: the original assert also doubled as a guard against a pre-supplied custom no_sync_func. Under the is None guard it becomes redundant, and a hypothetical pre-set custom func would now be left as-is rather than rejected. If you'd rather keep that rejection explicit (e.g. assert only on first setup), I'll adjust.

Fixes #1779

…reduce `train()` sets up `config.no_sync_func` on every step, but `config` is the model config and persists across steps. With `--overlap-grad-reduce` the first step sets it, then the second step trips `assert config.no_sync_func is None` and crashes. Guard the setup with `if config.no_sync_func is None:` so the sync funcs are set once (they are constant, so skipping later steps is a no-op). Fixes THUDM#1779

…potency test

HaozheZhang6 · 2026-06-20T19:24:13Z

Added a unit test for the idempotency (tests/test_overlap_grad_reduce.py). Pulled the setup into a small grad_reduce.py helper so it's testable without a full training step -- it stubs megatron's DDP the same way the other CPU tests do. Also ran it on a single GPU with a real megatron DDP: the 2nd step trips the assert without the guard, not with it.

…duce-no-sync-assert # Conflicts: # slime/backends/megatron_utils/model.py

HaozheZhang6 added 2 commits June 11, 2026 23:29

[megatron] extract overlap_grad_reduce setup into a helper + add idem…

f34341c

…potency test

HaozheZhang6 marked this pull request as ready for review June 20, 2026 19:24

Merge remote-tracking branch 'upstream/main' into fix/overlap-grad-re…

5d2c99e

…duce-no-sync-assert # Conflicts: # slime/backends/megatron_utils/model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] don't re-assert no_sync_func every step with overlap_grad_reduce#2066

[megatron] don't re-assert no_sync_func every step with overlap_grad_reduce#2066
HaozheZhang6 wants to merge 3 commits into
THUDM:mainfrom
HaozheZhang6:fix/overlap-grad-reduce-no-sync-assert

HaozheZhang6 commented Jun 12, 2026

Uh oh!

HaozheZhang6 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HaozheZhang6 commented Jun 12, 2026

Summary

Fix

Reproduction

Notes

Uh oh!

HaozheZhang6 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant